# Cross-Modal Generation

Llada V
LLaDA-V is a vision-language model based on the diffusion model, outperforming other diffusion multimodal large language models in performance.
Text-to-Image Safetensors
L
GSAI-ML
174
8
Qwen2.5 VL 7B Instruct Q8 0 GGUF
Apache-2.0
This model is a GGUF-format conversion of Qwen2.5-VL-7B-Instruct, supporting multimodal tasks and applicable to image and text interaction processing.
Text-to-Image English
Q
cxtb
72
1
Llama 3.2 90B Vision Instruct
Llama 3.2-Vision is a multimodal large language model developed by Meta, supporting image and text input with text output, excelling in visual recognition, image reasoning, image captioning, and visual question answering tasks.
Image-to-Text Transformers Supports Multiple Languages
L
meta-llama
15.44k
337
AA Chameleon 7b Base
A multimodal model supporting interleaved text-image input/output, based on Chameleon 7B model with enhanced image generation capabilities through the Align-Anything framework
Text-to-Image Transformers English
A
PKU-Alignment
105
8
4M 21 B
Other
4M is a 'any-to-any' foundational model training framework that achieves multimodal expansion through tokenization and masking techniques
Multimodal Fusion Safetensors
4
EPFL-VILAB
324
6
4M 21 L
Other
4M is an 'any-to-any' foundational model training framework extended to multiple modalities through tokenization and masking techniques
Multimodal Fusion
4
EPFL-VILAB
49
3
Ldm Text2im Large 256
Apache-2.0
High-resolution text-to-image generation model based on latent diffusion, achieving efficient image synthesis through latent space manipulation
Image Generation
L
CompVis
1,932
34
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase